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Abstract 



Wc propose an extension of quasi-Newton methods, and investigate 
the convergence and the robustness properties of the proposed update 
formulae for the approximate Hessian matrix. Fletcher has studied a 
variational problem which derives the approximate Hessian update for- 
mula of the quasi-Newton methods. We point out that the variational 
problem is identical to optimization of the Kullback-Leibler divergence, 
which is a discrepancy measure between two probability distributions. 
Then, we introduce the Bregman divergence as an extension of the 
Kullback-Leibler divergence, and derive extended quasi-Newton up- 
date formulae based on the variational problem with the Bregman 
divergence. The proposed update formulae belong to a class of self- 
scaling quasi-Newton methods. We study the convergence property of 
the proposed quasi-Newton method, and moreover, we apply the tools 
in the robust statistics to analyze the robustness property of the Hes- 
sian update formulae against the numerical rounding errors included 
in the line search for the step length. As the result, we found that the 
influence of the inexact line search is bounded only for the standard 
BFGS formula for the Hessian approximation. Numerical studies are 
conducted to verify the usefulness of the tools borrowed from robust 
statistics. 

1 Introduction 

We consider quasi-Newton methods for the unconstrained optimization prob- 



in which the function / : R n — > R is twice continuously differentiable on W 1 . 
The quasi-Newton method is known to be one of the most successful methods 
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for unconstrained function minimization. Details are shown in [15|. [T3] and 
references therein. 

The main purpose of this paper is to present an extended framework 
of quasi-Newton method, and to study the robustness property of quasi- 
Newton update formulae against numerical errors of line search. There 
are mainly two standard quasi-Newton method; one is the DFP formula 
and the other is the BFGS formula. Fletcher [7| has pointed out that the 
standard formulae, DFP and BFGS, are obtained as the optimal solution 
of a variational problem over the set of positive definite matrices. Along 
this line, we extend the quasi-Newton update formula. Then, we study the 
robustness property of the extended quasi-Newton methods, where we apply 
some techniques exploited in the field of robust statistics [TT] . 

We briefly introduce quasi-Newton formulae and its variational result. 
In quasi-Newton method, a sequence {xk}^L C K n is successively generated 
in a manner such that x k+ i = x k — ot k B^ V f(x k ). The coefficient a k € M 
is a step-size computed by a line search, and B k is a positive definite matrix 
approximating the Hessian matrix S/ 2 f(x k ) at the point x k . Let s k and y k 
be column vectors defined by 

s k = x k+1 -x k = -a h Bl 1 Vf(x k ) 1 y k = Vf(x k+1 ) - V/(x fe ). 

We need a Hessian approximation B k+ \ for V 2 /(x^ + i) to keep on the com- 
putation. In the DFP method, B k+ \ is given by 

R -r>DFP m . e 1,1 1 R B kSkvJ + yksjB k T y k yl y k yj 

B k+ i — a [B k , s k ,y k \ .— B k == h s k B k s k —^ — — + -y — , 

s k Vk {s k ykY s k Vk 

(2) 

and the BFGS method provides the different formula such that 

B k+l = B BFGS [B k ;s k ,y k ] := B k - B ^ Bk + ^ ( 3 ) 

Sfc B k s k s k [ y k 

When B k G PD(n) and sjy k > hold, both B DFP [B k ; s k ,y k ] and B BFGS [B k ; s k , y k ] 
are also positive definite matrices. In practice, the Cholesky decomposition 
of B k will be successively updated in order to compute the search direction 
—B^ 1 Vf(x k ) efficiently. The idea of updating Cholesky factors is pioneered 
by Gill and Murray [9]. Note that the equality 

B DFP [B k -s k ,y k }- x = B BFGS [B^;y k ,s k ] 
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holds. Hence, the update formula for the inverse H k+ \ = B k+1 can be 
directly derived from H k = B^ 1 without computing inversion of matrix. 

We introduce a variational approach in quasi-Newton methods. Let 
PD(?i) be the set of all n by n symmetric positive definite matrices, and 
the function tj) : PD(n) — > M be a strictly convex function over PD(n) de- 
fined by 

tp(A) = ti(A) - logdet A 

Fletcher [7] has shown that the DFP update formula ([2]) is obtained as the 
unique solution of the constraint optimization problem, 

1 /2 1 1 /2 

min ip(B, B~ Bj ) subject to Bsk = %, 

BePD(n) K K 

where A 1 / 2 for A £ PD(ra) is the matrix satisfying A 1 / 2 € PD(n) and 
(A 1 / 2 ) 2 = A. The BFGS formula is also obtained as the optimal solution of 

min iIj(B, BB, 1 ^ 2 ) subject to Bs k = Vh, 

SSPD(n) K K 

in which B k l ^ 2 denotes (B^ 1 ) 1 / 2 or equivalently (B^ 2 ) -1 . 

It will be worthwhile to point out that the function ip is identical to 
Kullback-Leibler(KL) divergence [TJ [12] up to an additive constant. Let 
N n (0,P) be the n dimensional Gaussian distribution with mean zero and 
variance-covariance matrix P € PD(n), then the KL-divergence between 
N n (0, P) and 7V„(0, Q) is defined by 

KL(P, Q) = triPQ- 1 ) - log det(PQ~ 1 ) - n 

which is equal to ^>{Q~ 1 ^ 2 PQ -1 ' 2 ) — n. The KL-divergence is regarded as 
a generalization of squared distance over the space of probability distribu- 
tions. Using the KL-divergence, we can represent the update formulas as 
the optimal solution of the following minimization problems, 

(DFP) min KL(B k ,B) subject to Bs k = y k , (4) 

_BGPD(n) 

(BFGS) min KL(B,B k ) subject to Bs k = y k . (5) 

BGPD(n) 

The KL-divergence is asymmetric, that is, KL(i- > , Q) ^ KL(Q, P) in general. 
Hence the above problems will provide different solutions. 

Here is the brief outline of the article. In Section [2] we introduce the so- 
called Bregman divergence which is an extension of the KL-divergence. In 
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Section [31 an extended quasi-Newton formula is derived based on the Breg- 
man divergence. In Section [H the convergence property of the proposed 
quasi-Newton method is studied, and Section [5] is devoted to discuss the 
robustness of the Hessian update formula. Numerical simulations are pre- 
sented in Section We conclude with a discussion and outlook in Section 
[7J Some proofs of the theorems are postponed to Appendix. 

Throughout the paper, we use the following notations: The set of positive 
real numbers are denoted as R+ C R. Let det A be the determinant of square 
matrix A, and GL(n) denotes the set of n by n non-degenerate real matrices. 
The set of all n by n real symmetric matrices is denoted as Sym(n), and let 
PD(ra) C GL(n) n Sym(n) be the set of n by n symmetric positive definite 
matrices. For two square matrices A, B, the inner product (A, B) is defined 
by tr(AB T ), and \\A\\p is the Frobenius norm defined by the square root 
of (A, A). Throughout the paper we only deal with the inner product of 
symmetric matrices, and the transposition in the trace will be dropped. For 
a vector x, \\x\\ denotes the Euclidean norm. The first and second order 
derivative of a function / : R — >• R are denoted as /' and /", respectively. 

2 Bregman Divergence induced from Potential Func- 
tions 

As introduced in Section [H the update formulae of the DFP and the BFGS 
methods are derived from the optimization problem of KL-divergence. In 
this section we introduce Bregman divergence [3] which is an extension of 
the KL-divergence. Especially we focus on the Bregman divergence induced 
from potential function. Then, we present extended quasi-Newton formulae 
derived from the variational problem for the Bregman divergence. 

Let ip : PD(n) — > R be a differentiable, strictly convex function that maps 
positive definite matrices to real numbers. We define Bregman divergence of 
the matrix P from the matrix Q as 

D(P, Q) = ip{P) - <p{Q) - (V(p(Q), P-Q), (6) 

where Vip(Q) is the n by n matrix whose element is given as ]jq—(Q)- 
The strict convexity of tp guarantees that D(P, Q) is non- negative and equals 
to zero if and only if P = Q holds. Figure [T] illustrates the relation between 
the function (p and the Bregman divergence. Note that D(P,Q) is convex 
in P but not necessarily convex in Q. Bregman divergences have been well 
studied for nearness problems in the fields of statistics and machine learning 

[21 El EH- 
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^T<p(Q) + {V<p{Q),P-Q) 



p 



D(P,Q) 



<p(P) 



Figure 1: The Bregman divergence defined by the strictly convex function 
ip : PD(n) — > R. Due to the strict convexity of ip, the function <p(P) lies 
above its tangents ip(Q) + {Vp(Q),P — Q). Hence the non-negativity of the 
Bregman divergence D(P,Q) is guaranteed. 



In this paper, we focus on the Bregman divergence induced from poten- 
tial function [IT]- Let V : R+ — > R be a strictly convex, decreasing, and 
third order continuously differentiable function. For the derivative V', the 
inequality V' < holds from the assumption. Indeed, the assumption leads 
to V < and V" > 0, and if V'(z ) = holds for some z$ G K+, then 
= holds for all z > zo- Hence V is affine function for z > Zq. This 
contradicts the strict convexity of V. We define the functions vy : M+ — > M. 
and /3y : R + -> R such that 



The subscript V of vy and /3y will be dropped if there is no confusion. 

Definition 1 (potential function). Let V : —¥ R 6e a function which 
is strictly convex, decreasing, and third order continuously differentiable. 
Suppose that the functions v and (3 defined from V satisfy the following 



vy{z) 



zV'(z) 



Vy(z) ' 
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conditions: 



for all z > and 



v(z) > 0, (7) 

P(z) < - (8) 
n 



lim — — — - = 0. (9) 

.. . , r, . .( ~\n— 1 V ' 



2^+0 v(z) n 



Then, V is called potential function or potential for short. For P 6 PD(n), 
the function V(detP) is also referred to as potential on PD(n). 

As shown in |17j . the function V(det P) is strictly convex in P € PD(?i) 
if and only if V satisfies ([7]) and (|8|). The condition ([9]) guarantees the 
existence of Hessian update formula, which is discussed in Section [3l 

Given a potential function V, the Bregman divergence defined from the 
potential function (p(P) = V(detP) in ([U]) is denotes as Dy(P,Q), and 
referred to as V -Bregman divergence. The ^-Bregman divergence has the 
form of 

D V (P, Q) = F(det P) - V(det Q) + i/(det P) - nz/(det Q). 

Indeed, substituting 

(W(detQ))« = ^ etQ) = W(detQ)^-^ = -KdetQ)(Q- 1 ), j , 

into ([6]), we obtain the expression of Dy(P, Q). Below we show some exam- 
ples of IZ-Bregman divergence. 

Example 1. For the negative logarithmic function V{z) = — log(z), we 
have v(z) = 1. Then V -divergence is equal to KL-divergence, 



D V (P,Q) =KL(P,Q) = {P,Q~ l ) - log de^PCr 1 ; 



n. 



Note that KL(P, Q) = KL(<5~ 1 ,P~ 1 ) holds. Hence, KL(P,Q) is convex in 
both P and Q~ l . 

Example 2. For the power potential V{z) = (1 — .z 7 )/7 with 7 < 1/n, we 
have v{z) = z 1 and f3(z) = 7. Then, we obtain 

D V (P, Q) = (det Qfi (P, Q- 1 ) + 1 ~ (detPQ- 1 ) 7 _ n 
The KL-divergence is recovered by taking the limit 0/7 —> 0. 



1. 



Example 3. For < a < b, let V(z) be V(z) = a\og(az + 1) — b\og(z). 
Then V(z) is a convex and decreasing function, and we obtain 

y(z) = b-a + -^—>Q, P{z) = , -ir <0 

az + 1 [az + l){a{b — a)z + b) 

for z > 0. The negative-log potential is derived by setting a = 0, b = 1. 
This potential satisfies the inequality < b — a < f{z) < b. The bounding 
condition of v will be assumed in the convergence analysis of Section ^ 

We apply y-Bregman divergences to extend quasi-Newton update for- 
mula. 



3 Extended quasi-Newton update formula 

To extend the standard quasi-Newton methods, we consider the optimization 
problem of the F-Bregman divergence instead of the KL-divergence. Let us 
define the 1/-BFGS formula as the optimal solution of the problem, 

(V-BFGS) min D v (B,B k ), subject to Bs k = y k . (10) 

BePD(n) 

Next we define 1/-DFP update formula which is an extension of the stan- 
dard DFP formula ([2]). Note that KL-divergence satisfies KL(P, Q) = 
KLtQ-SP" 1 ). 

Then, the optimization problem associated with the DFP update formula 
([!]) can be extended to the problem, 

(V-DFP) min D v (B~ l ,B~ l ), subject to Bs k = y k . (11) 

BGPD(n) K 

The problem (llip is convex in B -1 , since the objective function Dy(B~ l , B^ 1 ) 
is convex in B~ l and the constraint s k = B~ l y k is affine in B . Mainly we 
consider the F-BFGS update formula. The argument on the V-DFP update 
is almost the same. 

Theorem 1. Let B k € PD(?i), and suppose sjy k > 0. Then the problem 
(|10p has the unique optimal solution B k+ \ 6 PD(?i) satisfying 

_ v{detB k+l ) pB FGS w , „i, A ^(det B k+1 ) \y k yJ 
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The proof is found in Appendix [Al 

Note that the y-BFGS update formula is represented by the affine sum of 
B BFGS [B k ; Sk, yk] and UkuJ / s Jj/fc- This form is equivalent to the self-scaling 
quasi-Newton update [181 CE] defined as 

B k+1 = 9 k B BFGS [B k ;s k ,y k ] + (1 - 9 k )^, (13) 

s k Vk 

where 9 k is a positive real number. In the F-BFGS update formula, the 
coefficient 9 k is determined from the function v. The inverse of the matrix 
(fl~3]) is given by 

= UB BFGS [B k -s k ,y k ])-' +U-f) (14) 
9 k \ Vk J s k ' y k 

As the result, for any 9 k > 0, the matrix B k+ \ in (|13p is positive definite. 
Indeed, for < 6 k < 1 the expression (fl~3|) guarantees the positive definite- 
ness of B k+ i, and for 1 < 9 k , the expression (|14p implies B k+ \ £ PD(n). 
Therefore -Bfc+i hi (|12p is also positive definite matrix, since any potential 
V satisfies vy > 0. 

In the self-scaling update formula in (|13p . the choice 



(15) 



is often recommended. As analyzed in [16], however, the self-scaling method 
with inexact line search for the step length tends to lead the relative inef- 
ficiency compared to the standard BFGS method. Following Example H] 
below, we prove that the self-scaling method with the scaling parameter 
(|15p is not derived from the y-Bregman divergence. 

We present a practical way of computing the Hessian approximation 
(|12p . In Eq (|12p . the optimal solution B k+ \ appears in both sides, that is, 
we have only the implicit expression of B k+ \. The numerical computation 
is, however, efficiently performed as well as the standard BFGS update. 
To compute the update formula B k+ i, first we compute det-Bfc+i. The 
determinant of both sides of (1121) leads to 



det( B BFGS [B k ;s k ,y k ]) N „ , 

det B k+1 = y —— — i f * ,ykU ■ u(det B k+1 ) n -\ 16 

Hence, by solving the nonlinear equation 



det(B BFGS [B k ;s k ,y k }) 
u(det BkY" 1 



■v{z) n -\ z>0 
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we can find detBk+i- As shown in the proof of Theorem [TJ the function 
z/v(z) n ~ 1 is monotone increasing. Hence the Newton method is available 
to find the root of the above equation efficiently. Once we obtain the value 
of det-Bfc+i, we can compute the Hessian approximation B k+ \ by substi- 
tuting det-Bfc + i into Eq (fT2j) . Figure [3] shows the update algorithm of the 
V-BFGS formula which exploits the Cholesky decomposition of the approx- 
imate Hessian matrix. By maintaining the Cholesky decomposition, we can 
easily compute the the determinant and the search direction. In the algo- 
rithm of Figure El we require the Wolfe condition [15} Section 3.1] for the 
step length a k . As shown in Section 0J the Wolfe condition is useful to 
establish the convergence property of the optimization algorithm. 

In the same way as the proof of Theorem[TJ we obtain the V-DFP update 
formula defined from (1111) such that 



^((det^)- 1 ) DFP ( ^((detfffc)- 1 ) \y k yj 

B k+i = , (A 7 „ rrr^ l^k, s k ,yk +1 v^TT ~T — ■ 

u({detB k+ i) L ) V u((detB k+1 ) L )Js<.y k 

(17) 

It is straightforward to unify the V-BFGS method and the V-DFP method in 
the same way as the standard Broyden family [3J. Let -By 1 F ^ 1 be the Hessian 
approximation given by the V-BFGS update formula with the potential 
V = V±, and By^ +1 be the Hessian approximation given by the V-DFP 
update formula with the potential V = Vi- Then the update formula of the 
(Vi, V2)-Broyden family is defined by 

B k+1 = *Bg^s >fc+1 + (1 - *) B { ^ k+1 , (18) 

for •& 6 [0,1]. The (Vi, V2)-Broyden family is obtained by a convex-full of 
B BFGS [B k ;s k ,y k ], B DFP [B k ; s k ,y k ] and y k y k /sjy k . The standard Broyden 
family is recovered by setting Vi(z) = V2{ z ) = — log 2. 

Example 4. We show the V-BFGS formula derived from the power poten- 
tial. Let V(z) be the power potential V(z) = (1 — z" 1 )/^ with 7 < 1/n. As 
shown in Example^ we have v(z) = z 1 . Due to the equality 

det(B BFGS [B k ;s k ,y k ]) = det(B k )-^- 

s k B k s k 

and Eq. f)16[) . for the power potential we have 

v(detB k+1 ) _ / sjy k \ p _ 7 



u(detB k ) \sjB k s k J ' P 1- (71-1)7' 
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F-BFGS update: 

Initialization: The function u(z) denotes —V'{z)z. Let Bq £ 
PD(n) be a matrix which is an initial approximation of the 
Hessian matrix, and LqLq = Bq be the Cholesky decomposi- 
tion of Bq. Let xo € R n be an initial point, and set k = 0. 

Repeat: If stopping criterion is satisfied, go to Output. 

1. Let Xfc+i = x k — a k B7 l V f (xk) , where a k > is a step 
length satisfying the Wolfe condition [151 Section 3.1]. 
The Cholesky decomposition B k = L k L^ is available to 
compute B^ l V f{x k ). 

2. Set s k = x k+l - x k and y k = Vf{x k+X ) - V/(x fc ). 

3. Update L k to L which is the Cholesky decomposition of 
B BFGS [B k ;s k ,y k ], that is, 

IV = B BFGS [B k ;s k ,y k ] = B BFGS [L k L T k ; s k ,y k ). 

The Cholesky decomposition with rank-one update is 
available. 

4. Compute 

(detZ) 2 



C 



K(detL fc )2)«-i 
and find the root of the equation 

C ■ v{z) n ~ x = z, z>0. 

Let the solution be z* . 

5. Compute the Cholesky decomposition L k+ i such that 

fc+1 fc+1 ~-((detL fc ) 2 ) LL + V v{{tetL k y))4y- k 

6. k<-k + l. 

Output: Local optimal solution x k . 



Figure 2: Pseudo code of F-BFGS method. The Cholesky decomposition 
with rank-one update is useful in the algorithm. 
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Then the V-BFGS update formula is given as 

\Sf,B k s k J \ \slBkSkJ J s k [ y k 

For 7 such that 7 < 1/n, we have — l/(n — 1) < p < 1. Remember that the 
standard self-scaling update formula corresponds to the above update with 
p = 1. Therefore, the standard self-scaling update formula is not derived 
from the power potential. Indeed, the power potential with p = 1 or equiva- 
lent^ 7 = 1/n is a convex function but not a strictly convex function. 

In terms of the self-scaling update formula, we show the following propo- 
sition. 

Proposition 2. There does not exist the potential function such that in 
Eq. (|12j) the equality 

t/(det gfc+i) = sjy k 
u{detB k ) sjB k s k 

holds for any B k £ PD(n) and any s k ,y k £ W 1 satisfying sjy k > 0. 
Proof. We have two equalities, 

det(B BFGS [B k ;s k ,y k ]) = det(B k )^^, 

s k B k s k 



dei( B BFGS [B k ;s k , y k }) 
v(det B^- 1 



detB k+1 = -jj^ ± ^:; ,aK,/ u(detB k+1 ) n ~ 

Hence, we have 



z/(det B k+1 )\ n _ detB k+1 s k B k s k 



v(detB k ) J detB k s Jy k 

Suppose that there exists a potential function satisfying (j 19f) . Then we have 
s k Uk \ n 1 detBjfc+i sjB k s k 



,sjB k s k J detB k s Jy k 

and hence the equality 



det B k+ \ = det(B k 
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holds. Substituting the above formula into (fl~9j) . we have 
v ( det(B k ) ( = v(detB k )- ^ 



s k ] B k s k J J s k ] B k s k 

Let B k be a positive definite matrix such that det-B^ = 1, and z be z = 

y^ fc ) . Then we have v(z) = v(l)z 1 ' n for z > 0. The corresponding 

f3y is given as /3y(z) = 1/n, and this does not satisfy the definition of the 
potential function. □ 



4 Convergence Analysis 

We consider the convergence property of the y-BFGS method. Some stan- 
dard assumptions about the objective function / are stated below. See 
Section 6.4 of [15] for details. 

Assumption 1. 1. The objective function f is twice continuously differ- 
entiable. 

2. Let V 2 f(x) be the Hessian matrix of f at x. For the starting point xq, 
the level set C = {x G W 1 \ f{x) < f(xo)} is convex, and there exist 
positive constants m and M such that 

m\\zf < z T W 2 f{x)z < M\\z\\ 2 (20) 

holds for all z £ M n and x G C 

The following theorem implies that the sequence {x k } generated by the 
y-BFGS update formula converges to the local minimizer of / if the function 
vy of a potential V satisfies the bounding condition. 

Theorem 3. Let Bq G PD(n) be an initial matrix and xq G M n be a starting 
point which meets Assumption^ Suppose that there exist positive constants 
Li, Li > such that L\ <v<Li. Then the sequence {x k } generated by the 
V-BFGS update converges to the minimizer x* of f. 

Lemma 4 (Eq. 6.12 in |15|). Let G be the averaged Hessian 

G = I V 2 /(zfc + Ts)dr, s = x k+l -x k e M n , 
J 

then the property y = Gs follows from Taylor's theorem, where y = V/(xfc+i) — 
V/(x fc ). 
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Using Lemma [J we prove Theorem [3] in a manner similar to Section 8.4 

in mn. 



Proof of Theorem^ Let B k , k = 0, 1, 2, . . . be the sequence of approximate 
Hessian matrices generated by the U-BFGS update formula. We define B k +i 
and B k by B k+l = „( det fi fc+1 ) -Bfc+l and ^fc = v (detB k ) B k' respectively. Then 
the update formula shown in Theorem [T] is represented as 

5 s B kSk sjB k i a/fci/fc r9n 

B k+ i-B k =-= 1- rr-Tp — • ^ 

s k BkSk v{<ietB k+1 ) sly k 

We compute 

VK-Sfc+i) = tr(JB fc+ i) - logdet B k+1 . 
The inequality (f20|) yields 

#rl = #%^ ( 22 ) 



|y fc || 2 _ SfcG 2 s fc 



< M. (23) 



We now define 

a s lBkS k _ sjB k s k 

IISfellll-DfeSifcll 

Then the trace of is bounded above. Indeed, the inequality 

tr(J3 fc+ i) = tr(fl fc ) - + < tr(5 fc ) + 17T5 > 

slB k s k u(detB k+ i)sly k cos z k u(detB k+lj 

holds, where (|23|) is used. Using the formula det(/ + xy T + uv T ) = (1 + 
x T y)(l + u T v T ) — (x T v)(y T u) for B k +i, we obtain a lower bound of the 
determinant det(B k +{) such that 

det(^ +1 ) = det(g fc ) ^ * v I" f f kV " > det(B k ) 



u(det B k+ i) sjB k s k \\s k \\ 2 " q k v(detB k+1 ) 
These inequalities present an upper bound of ip(B k+ i), 

( M m 
ip(B k+1 ) < iJ;(B k ) + ——^ r - log ■ 



v(detB k+1 ) v(detB k+l ) 

+ ( 1 - -ttt + lo § -ttt) + lo s cos2 

\ cos z & k COS A tt k J 

- ( M m \ 9 

< ^(J3fc) + ( — - log — - 1 J + log cos 2 d k . 
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The second inequality is derived from 



1 TT + lo S TT - °- 

COS z C/fc COS^ 



As the result we obtain 



< i/>(B k+1 ) < iP(B ) + c(k + l) + Y^ log cos 2 if 

j"=i 

where c is a positive constant such that c > j~ — log — 1. Let us then 
proceed by contradiction and assume that cos9j — > 0. Then there exists 
k\ > such that for all j > k±, we have 

log cos 2 9j < —2c. 

Thus the following inequality holds for all k > k±: 

ki 

< V(£o) + c(k + 1) + log cos 2 0,- + (A; - h)(-2c) 

j'=i 

= V(-Bo) + log cos 2 + c(2fci + 1) - 2cfe. 
i=i 

The right-hand-side is negative for large k, giving a contradiction. Therefore 
there exists a subsequence satisfying cos 9j k > 5 > 0. By Zoutendijk's result^ 
with the Wolfe condition, this limit implies that liminf^oo ||V/(a;fc)|| = 0. 
The convexity of / on C guarantees that xt converges to the local optimal 
solution. □ 

The potential defined in Example [3] meets the condition of Theorem El 
while the power potential V(z) = (1 — z 1 )/^ with v{z) = z 1 does not satisfy 
the condition. 



5 Robustness against Inexact Line Search 

The robustness against numerical errors such as the round-off error is an 
important feature in numerical computation. In this section we study the 
robustness of quasi-Newton update against numerical errors involved in the 
line search. Mainly there are two types of quasi-Newton updates: one is the 

1 Under some condition, X^>o cos2 ^ ll^/fe)!! 2 < 00 holds. See Theorem 3.2 in [TS] 
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update formula for approximate Hessian matrix; and the other is the update 
for approximate inverse Hessian matrix. In the approximate inverse Hessian 
update, the matrix H k = B^ is directly update to H k+ \ = B^ , 1 under 
the secant condition H k+ iy k = s k . We study four kinds of update formulae, 
that is, y-BFGS/y-DFP method for the Hessian approximation/the inverse 
Hessian approximation. 

Let us consider the Hessian approximation formula. Under the exact line 
search, the matrix B k is updated to B k+ \ which is the minimum solution of 
Dy(B, Bk) or Dy{B~ l ,B^ 1 ) subject to Bs k = y k . Let 

Xk+i = x k - atkB^V f (x k ) = x k + s k 

be the point computed by the exact line search. When the line search is 
inexact, the step length a k will be slightly perturbed and then s k will be 
changed to (1 + e)s k where e is an infinitesimal. The vector y k will also 
change to y k defined by 

Vk = V/(a* + (1 + e)s k ) - V/(x fc ) = y k + eV 2 f(x k+1 )s k + 0(e 2 ). 

Then the constraint for the Hessian update becomes (1 + e)Bs k = y k . 

We study the relation between the perturbation of s k and the Hessian 
approximation B k+ \ or the inverse Hessian approximation H k+ \. Based on 
the above argument, we consider the optimization problem defined by 

(V-BFGS-B) min D V (B, B k ) subject to (1 + e)Bs = y + ey, 

BePD(n) 

(24) 

(F-DFP-B) min D v (B~ l ,B7: 1 ) subject to (l + e)Bs = y + ey 

BePD(n) K 

(25) 

for a fixed matrix B k G PD(n) and fixed vectors s,y,y £ W 1 , where the sub- 
script k for the vectors is dropped for simplicity. In the same way, the update 
formula for the inverse Hessian under the inexact line search is defined as 
the optimal solution of the following problem, 

(F-BFGS-H) min D v (H- x ,H~ l ) subject to H(y + ey) = (1 + e)s, 

H£P~D(n) K 

(26) 

(V-DFP-H) min D v (H,H k ) subject to H(y + ey) = (1 + e)s, 

_ffePD(n) 

(27) 
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for fixed H k G PD(n), s,y,y G M. n . The update formula given by V-BFGS- 
H/V-DFP-H directly provides the inverse matrix of B k+ \ computed by V- 
BFGS-B/V-DFP-B, respectively. Theorem [TJ guarantees that there exists 
the unique optimal solution as long as s T (y + ey) > holds. Though The- 
orem Q] deals with only V-BFGS-B formula, we can prove the existence and 
the uniqueness of optimal solution for the other problems in the same man- 
ner. 

In order to study the robustness of update formulae, we borrow the 
concepts such that the influence function or the gross error sensitivity from 
the study of robust statistics Below the V-BFGS-B update formula is 
considered as an example. Let B{e) be the optimal solution of V-BFGS-B 
in (|24p . Then the influence function of B{e) is defined as the derivative of 
B{e) at e = 0, that is, 

• , . , B(e)-B(0) 
BO = lim — ^ y —t. 

e^o e 

Later we prove the differentiability of B(e). From the definition of the 
influence function, the optimal solution B{e) is asymptotically equal to 
-B(O) + eB(0). This implies that the inexact line search has a large im- 
pact on the computation of Hessian approximation, when the norm of B(0) 
is large. In the sense of the influence function, the preferable potential is the 
function V which provides the influence function B(0) with a small norm. 

For fixed vectors s and y such that s T y > 0, the influence function B(0) 
depends on the matrix B k and the vector y. We consider the worst-case 
evaluation of the influence function in terms of B k and y. The gross error 
sensitivity is defined as the largest norm of the influence function, that is, 

gross error sensitivity = sup {||S(0)||j? | B k G B C PD(n), y G V C M n }, 

where B C PD(n) and V C W 1 are appropriate subsets. In many case, the 
gross error sensitivity becomes infinity if B or V is unbounded. Our concern 
is to find the potential function V which leads finite gross error sensitivity 
under some reasonable setup. 

The influence function and the gross error sensitivity have been studied 
in robust statistics [TT]. We use these statistical techniques to analyze the 
stability of numerical computation. In the literature of statistics, the "sta- 
tistical model" {B G PD(n) | Bs k = y k } or {H G PD(rc) | Hy k = s k } 
is fixed, and the "observed data" B k or H k is contaminated such that 
B k + eB(0) + 0(e 2 ), while in the present analysis, the matrix B k = H7 is 
fixed and the model corresponding to the secant condition is perturbed. 
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Table 1: Gross error sensitivity of V-BFGS formula and V-DFP formula for 
the Hessian approximation and the inverse Hessian approximation. Only 
the standard BFGS for the Hessian approximation has finite gross error 
sensitivity. 





y-BFGS 


V-BFP 


Hessian approx. 


finite only for BFGS 


oo 


inverse Hessian approx. 


oo 


oo 



The potential function minimizing the gross error sensitivity will be 
preferable for robust computation. Below we prove that the standard BFGS 
update for the Hessian approximation is the more robust than the other up- 
date formulae. This result meets the empirical observations [5j[T5]. More- 
over, only the standard BFGS update for the Hessian approximation has 
finite gross error sensitivity. Theoretical results are summarized in Table [TJ 

In the following, the gross error sensitivity with B = PD(n) and a 
bounded subset y is considered. Note that the boundedness of y follows 
the assumption that ||V 2 /||i? is bounded above over W 1 . First, we note 
that the influence function and the gross error sensitivity make sense for 
minimization of non-quadratic functions. 

Lemma 5. Suppose that the objective function f(x) is a convex quadratic 
function. Then, the influence function and the gross error sensitivity are 
equal to zero. 

Lemma [5] is clear, since for the quadratic objective function the secant 
condition Bs = y is changed to B(l + e)s = (1 + e)y under the inexact line 
search. That is, the secant condition is kept unchanged, and thus B(e) = 
B(0) holds. 

We prove that generally the influence function is well-defined. 

Theorem 6. Suppose that s T y > holds for vectors s and y in the prob- 
lems ()24|) . ()25p . (j26|) and ()27p . Then, for small e, the optimal solutions of 
V-BFGS-B, V-DFP-B, V-BFGS-H and V-DFP-H are all uniquely deter- 
mined. The optimal solutions are second-order continuously differentiable 
with respect to e in the vicinity of e = 0. 

Proof is deferred to Appendix [Bl 

The gross error sensitivity of each update formula is computed in the 
following theorems. Proofs are deferred to Appendix [Cl 
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Theorem 7 (gross error sensitivity of V-BFGS-B). Suppose n > 3. Let s 
and y be fixed vectors such that s T y > and y be a bounded subset in W n . 
For small e, let B{e) be the optimal solution of V-BFGS-B in (|24p . Then, 
the optimal potential function of the problem 

minmax|| J B(0)|| F subject to B t G PD(n), y G y (28) 

V B k ,y 

is given as V(z) = — log(z) up to a constant factor. In the above min-max 
problem, the function V is sought from among all potentials. 

Theorem 8 (gross error sensitivity of V-DFP-B). Suppose n > 3. Let s 
and y be fixed vectors such that s T y > and y be a bounded subset in 
M. n . Suppose that there exists an open subset included in y. Let B(e) be 
the optimal solution of V-DFP-B in (|25p . Then for any potential V, the 
equality 

sup{||5(0)|| F | B k G PD(n), y G y} = oo 

holds. 

Theorem 9 (gross error sensitivity of y-BFGS-H). Suppose n > 4. Let 
s and y be fixed vectors such that s T y > and y be a bounded subset in 
R n . Suppose that there exists an open subset included in y. Let H(e) be 
the optimal solution of V -BFGS-H in (|26p . Then, for any potential V , the 
equality 

sixp{\\H(p)\\ F | H k e PD(n), y G y} = oo 

holds. 

Theorem 10 (gross error sensitivity of V-DFP-H). Suppose n > 3. Let 
s and y be fixed vectors such that s T y > and y be a bounded subset in 
R n . Let H(e) be the optimal solution of V-DFP-H in (|27p . Then, for any 
potential V, the equality 

sup{\\H(0)\\ F | H k G PD(n), y G y} = oo 

holds. 

It is well-known that there is the dual relation between the BFGS formula 
and the DFP formula. Indeed, the y-DFP update for the inverse Hessian 
approximation is derived from the V-BFGS update formula for the Hessian 
approximation by replacing Bk,s k ,yk with Hk,yk,Sk- For the robustness 



18 



against inexact line search, however, the dual relation is violated as shown 
in Table [TJ In this problem, we focus on the perturbation of the vector 
rather than that of y k - This is the reason why the dual relation is violated. 
Powell has shown a critical difference between BFGS and DFP for quadratic 
convex objective functions [115] by considering the behaviour of eigenvalues 
of approximate Hessian matrix. In the present paper, we exploited the gross 
error sensitivity which is meaningful for non-quadratic objective functions as 
shown in Lemma [5] Our approach also provides a critical difference between 
BFGS and DFP methods. 

In Section [3] we introduced the (Vi, V^-Broyden family defined by (fT8|) . 
It is straightforward to prove that only the standard BFGS has finite gross 
error sensitivity among the (Vi, V2)-Broyden family with a fixed mixing pa- 
rameter # € [0,1]. 

6 Numerical Studies 

We demonstrate numerical experiments on robustness of quasi-Newton up- 
date formulae such as V-BFGS-B, V-DFP-B, F-BFGS-H, and V-DFP-H 
proposed in Section [5] Especially, the update formula derived from power 
potential in Example [2] is examined. 

In the first numerical study, we consider numerical stability of update 
formulae. Let B(e) be the optimal solution of F-BFGS-B (JMD or F-DFP-B 
(1251) . and H(e) be the optimal solution of F-BFGS-H ([26]) or V-DFP-H (1271) . 
For each update formula, we numerically compute the approximate influence 
function \\(B(e)-B(0))/e\\ F and \\(H(e)-H(0))/e\\ F with small e, where the 
power potential V(z) = (1 — z' y )/z is used to derive the approximate Hessian 
matrix. Remember that y~BFGS and V-DFP are respectively reduced to 
the standard BFGS and DFP when 7 is equal to zero. 

In what follows, we show the setup of numerical studies. Let diag(cti, . . . , a n ) 
be the n by n diagonal matrix with diagonal elements a±, . . . ,a n . For V- 
BFGS-B and F-DFP-B, the matrix is set to one of the following three 
matrices: 

B k = diag(l,...,n)/(n!) 1 / n , B k = diag(l, . . . , n), or B k = I + n 3 ■ pp T , 

where in the last one I is the identity matrix and p is a column unit vector 
defined below. The dimension of the matrix B k is set to n = 10, 100, 500 
or 1000. The first matrix diag(l, . . . ,n)/(n!) 1,/n has the determinant one, 
and the other two matrices have a large determinant. Below we show the 
procedure for generating the vectors s and y and the contaminated vectors 
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(l+e)s and y+ey for 1/-BFGS-B and F-DFP-B. In the numerical studies for 
F-BFGS-H and V-DFP-H, the matrix is replaced with the approximate 
inverse Hessian Hk- 

1. In the case that Bk is diag(l, . . . , n)/(n\) 1 l n or diag(l, . . . , n), the vec- 
tors s and y are both generated according to the multivariate normal 
distribution with mean zero and variance-covariance matrix 10 x I. 
If the inner product s T y is non-positive, the sign of y is flipped. 
The intensity of noise involved in the line search is determined by 
e, which is generated according to the uniform distribution on the 
interval [—0.2,0.2]. Then, the vector y is also generated according 
to the multivariate standard normal distribution. If the inequality 
(1 + e)s T (y + ey) > does not hold, again e and y are generated until 
the vectors enjoy the positivity condition. 

2. In the case that Bk is supposed to have the expression I + n 3 ■ pp T , 
first the vector s is generated according to the multivariate normal 
distribution with mean zero and variance-covariance matrix 10 x /, 
and y is defined such that y = s. The vector p is a unit vector which is 
orthogonal to y, that is, p is a vector satisfying p T y = and ||p|| = 1, 
and let Bk be Bk = I + n 3 ■pp T . Then the vector y is defined as y = p. 
The construction of these vectors is used in the proof of Theorem [9] 
and Theorem 1101 

Hessian or inverse Hessian update formula is applied to Bk or Hk with the 
randomly generated secant condition. The updated matrix B(0) and B(e) 
are respectively computed under the constraint Bs = y and B(l + e)s = 
y + ey by using 1/-BFGS-B and y-DFP-B update formula. In the same 
way, T/-BFGS-H and 1/-DFP-H are respectively applied to compute H(0) 
with the constraint Hy = s and H{e) with the perturbed secant condition 
H(y + ey) = (1 + e)s. The influence function of each update formula is 
approximated by ||(B(e) - B(0))/e\\ F or \\{H{e) - H{0))/e\\ F . 

Table [2] shows the average of the approximate influence function over 
20 runs for each setup. When Bk or Hk is equal to the diagonal matrix 
diag(l, . . . , n)/(n!) 1//n , we see that the power 7 of the power potential does 
not significantly affect the influence function in both 1/-BFGS and y-DFP. 
For the other setups, overall the BFGS method for Hessian matrix, i.e. V- 
BFGS-B with 7 = 0, has smaller influence function than the other update 
formulae. The 1/-DFP-H for inverse Hessian update also has relatively small 
influence function when Hk is proportional to diag(l, . . . , n). For Hk = 
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I+n 3 pp T , however, we find that V-DFP-H is sensitive against noise involved 
in the line search. 

These numerical results meet the theoretical analysis as shown below: 

1. Theorem [7] implies that the standard BFGS method is robust against 
inexact line search. 

2. As shown in Example U V-BFGS-B update with power potential is 
close to the standard BFGS update for large n and moderate det(-Bfc). 
That is, the mixing parameter (sTy^/ 'sTB^S}.) p in Example 0] will be 
close to one if n is large and sjyk/sj B^Sk does not depend on the 
dimension n that much. When has a large determinant which grows 
with the dimension n, the number of s^y^/sj B^s^ will severely depend 
on the dimension n. Hence, the mixing parameter {sjyk/sj B^s^Y win 
not close to one even for large n. Hence, in such case the influence 
function is affected by the choice of the power 7. The same argument 
on the relation between influence function and the power 7 will hold 
for the inverse Hessian update, that is, V-BFGS-H and V-DFP-H. 

3. For B k = I + n 3 pp T the result on V-BFGS-B and V-DFP-B is numer- 
ically the same. Under this setup, we can theoretically confirm that 
the influence functions of both update formula are identical to each 
other. On the other hand, some calculation yields that the influence 
functions of V-BFGS-H and V-DFP-H are not the same. 

The standard BFGS update formula achieves the min-max optimality of 
the gross error sensitivity. That is, BFGS method may not be necessarily 
optimal for each setup. In numerical studies, however, BFGS method uni- 
formly provides fairly stable update formula compared to the other methods. 

Next, we apply the standard BFGS-B and DFP-B to solve the following 
two optimization problems: the quadratic convex problem 



(Problem 1) 



min fix) 

xeR n 



-x Ax — ex 



where e = (1, . . . , 1) T € R n and 
/ 2 -1 



\ 



-1 2 



-1 



,4 



-1 



2 



nxn 




-1 



V 
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x Ax — ex 



(n + 1) 2 



1 




i=l 



where the vector e and the matrix A are the same as problem 1. The 
objective function in problem 2 is non-linear and non-convex. The initial 
point xq is randomly generated by n-dimensional normal distribution with 
mean zero and variance-covariance matrix 10 x /. The termination criterion 



is employed, which is the same criterion used by Yamashita |20| . Although 
the second criterion above implies that the method fails to obtain a solution, 
all trials did not reach the maximum number of iterations. In each problem, 
the step-length a k is computed by the matlab command "f minbnd" with the 
option TolX = 10~ 12 which denotes the termination tolerance on x. In the 
same way as the numerical studies on robustness of update formulae, the 
vector s k = x k+ \ — x k is randomly perturbed such that s k = (l+e)s k , where 
e is a random variable according to the uniform distribution on the interval 
[—h,h]. The number of h varies from to 0.3. Accordingly, the vector 
Vk = V/(x fc+ i) - V/(x fc ) is also changed to y k = Vf(x k + s k ) - Vf(x k ). As 
the result, for each iteration the secant condition with inexact line search is 
given as Bs k = y k . 

The average number of iterations over 20 runs for BFGS and DFP is 
shown in Table El Compared to DFP method, BFGS method requires fewer 
number of iterations to reach the optimal solution. Moreover, in BFGS up- 
date the number of iterations is stable against the number of h. This result 
implies that BFGS is robust against random noise involved in inexact line 
search. On the other hand, the behaviour of DFP method is sensitive to 
contaminated step-length. Indeed, the number of iterations in DFP method 
rises drastically with the intensity of the noise. For the quadratic convex 
objective function, the inexact line search does not affect the secant condi- 
tion. Hence the numerical result will impliy that the goodness of the descent 
direction B~^ l V f{x k ) in DFP will be easily degraded by inexact line search. 
These numerical properties in quasi-Newton methods have been empirically 
well-known [5j [15] . Powell |19j has theoretically studied the progression of 
eigenvalues in approximate Hessian matrices in order to illustrate the differ- 
ence between BFGS and DFP. 

Through the numerical stduies in this section, we found that the theoret- 
ical framework exploiting robust statistics can be a useful tool to investigate 
the property of quasi-Newton methods. 



||V/(x fc )|| <nx 10 



or k > 50000 
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7 Concluding Remarks 



Along the line of the research stared by Fletcher [7], we considered the 
quasi-Newton update formula based on the Bregman divergence induced 
from potential functions. The proposed update formulae for the Hessian 
approximation belong to the class of self-scaling quasi-Newton method. We 
studied the convergence property. Then, we applied the tools in the robust 
statistics to analyze the robustness of the Hessian update formulae. As the 
result, we found that the influence of the inexact line search is bounded only 
for the standard BFGS formula for the Hessian approximation. Numerical 
studies support the usefulness of the theoretical framework borrowed from 
the robust statistics. 

It will be an interesting future work to investigate the practical advan- 
tage of the self-scaling quasi-Newton methods derived from the U-Bregman 
divergence. Nocedal and Yuan proved that the self-scaling quasi-Newton 
method with the popular scaling parameter (|15p has some drawbacks [16] . 
In our framework, the self-scaling quasi-Newton method with the scaling 
parameter (|15p is out of the formulae derived from U-Bregman divergence. 
More precisely, the function V(z) = n(l — z l l n ), which is not potential, 
formally leads the popular self-scaling quasi-Newton formula. For the corre- 
sponding Bregman divergence Dy(P,Q), the equality Dy(P,cP) = holds 
for any P 6 PD(n) and any c > 0. This property implies that the scale of the 
Hessian approximation is not fixed. We think that this property may lead 
some inefficiency of the self-scaling quasi-Newton method with (|15p . The 
self-scaling quasi-Newton method associated with F-Bregman divergence 
may performs well in practice. 

Another research direction is to consider the choice of the potential func- 
tion V . Under the criterion of the gross error sensitivity, we found that the 
negative logarithmic function V{z) = — \ogz is the optimal choice. The 
other criterion may lead other optimal potentials. Investigating the relation 
between the criterion for the update formula and the optimal potential will 
be beneficial for the design of numerical algorithms. 
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A Proof of Theorems [T] 

We prove the following lemma which is useful to show the existence of the 
optimal solution. 

Lemma 11. Let V be a potential and v = vy. For any C > the equation 

Cv{z) n ~ l = z, z>0 (29) 

has the unique solution. 

Proof. We define the function ((z) by C,{z) = logz — (n — l)logv(z), then, 
the ([29]) is equivalent to the equation 

logC7 = C(*), z>0. (30) 

Since the potential function satisfies lim^+o z/u(z) n ~ 1 = from the defi- 
nition, we have lim 2 _s. + o C( z ) = — oo. In terms of the derivative of C( z ), we 
have the following inequality 

d », x 1 , 1 

7^=--»»-l— >->0- 

az z z zn 

Thus, ((z) is an increasing function on R + . Moreover we have 

log z 



C(*)>C(i)+ f —dz = C(i) + 

Ji zn 



n 



The above inequality implies that lim 2 _ i , 00 £(z) = oo. Since £(z) is continu- 
ous, the equation fl3Qf) has the unique solution. □ 



Proof of TheoremUl First, we show the existence of the matrix B k+ \ satis- 
fying (fl2j) . Lemma [TT1 now shows that there exists a solution z* > for the 
equation 

det(B BJ,cs [B fc ; afc ,y fc ]) x _ 

KdetBfc)"- 1 lj ' > 

By using the solution z*, we define the matrix B such that 

p _ ^(^*) r>BFGS\r3 . „ „l,n 

~~ "TTTirT [B k ,s k ,y k + (1 TTTpT J^T — ' 

then the determinant of -B satisfies 
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in which the first equality comes from the formula det(A+vu T ) = det(-A)(l+ 
u T A~ 1 v) and the second one follows the definition of z*. Hence there exists 
B k+1 G PD(n) satisfying ([12]) . 

Next, we show that the matrix B k +i in (|12p satisfies the optimality 
condition of ([TO]) . According to Giiler, et al. [TO], the normal vector for the 
affine subspace 

M = {Be PD(n) | Bs k = y k } 
is characterized by the form of 

s k X T + XsJ G Sym(n), A G R n . (31) 
In fact for Bi,B 2 G At we have 



{s k X 1 + Xs k \ , B\ - B 2 ) = A B lSk + Sfc BiX - A £ 2 s fc - ^ 5 2 A 

= A T 
= 0, 



A T ?/fc + yJX - X T y k - yJX 



and thus s k X + Xs k is a normal vector of A4. Giiler, et al. [TO] have shown 
that the normal vector is restricted to the form of (|3ip . 

Suppose B' G PD(n) be an optimal solution of {[TO]) , then B' satisfies the 
optimality condition that there exists a vector A G M. n such that 



V B D v (B,B k )\ R=RI = s k X T + Xsj 



-^(det( J B'))( J B / )" 1 + i/(det(B fc )) B * 1 = s k X T + Xsj 



\B=B' ~ 

where VbDv{B, B k ) denotes the gradient of Dv(B,B k ) with respect to 
the variable B. Also, the optimal solution B' should satisfy the constraint 
B's k = y k . On the other hand, the matrix B k+ \ defined by (|12|) satisfies 

! j^detBfc)^ BFGS „ n-i , /^i ^(det^ fc ) \ s k sj 

B k+1 - TT—B \\P [Bk,S k ,y k \) + 1 r — p 

K+1 zy(det5 fc+ i) V ^(detB fc+ i)y s^y fe 

^(det-Bfc) pflFPrp-i... „ i , A KdetSfc) \ 



- J B^^[ J B- i ;y fc , Sfc ]+ 1 



zy(det5 fc+ i) fc V ^(detS fc+ i)y 



/(det B k+1 )B^ + i/(det .Bfc)^ 1 = s k X T + AsJ, 



i/(det5 fc ) j ^(detSfc+i) z/(det B k )yjB k y k 

The conditions s~[y k > and B k G PD(n) guarantees the existence of the 
above vector A. In addition, the direct computation yields that the con- 
straint B k+ \s k = y k is satisfied. Hence, B k+ \ satisfies the optimality con- 
dition. Since ([TO]) is a strictly convex problem, B k+ \ is the unique optimal 
solution. □ 
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B Proofs of Theorems [6] 



We show that the optimal solution of F-BFGS-B is second order continu- 
ously differ entiable. The same proof works for the other update formulae. 

Proof. We consider the problem (|24p . Since the inequality s T (y + ey) > 
holds for infinitesimal e, Theorem [1] guarantees that there exists the unique 
optimal solution B(e) around e = 0. Let the function F : R nxn x R — > R nxn 
be 

F ^ ^ = w^x) x ~ w^) BBFGS[Bk '> (1 + y + £V] 

_ ( 1 1 \ (y + ey)(y + ey) T 

\u(detX) v{det B k ) ) (I + e)s l (y + ey)' 

for X e M nxn and eel. For infinitesimal e, the equality F(B(e),e) = O 
holds, where O is the null matrix. We apply the implicit function theorem 
to prove the differentiability of B(e). Since the potential function is third 
order continuously differentiable, clearly F(X, e) is second order continu- 
ously differentiable in a vicinity of (X,e) = (B(0),0). For any symmetric 
matrix A € Sym(n), the equality 

V X (F(X, e), A)\ x=B{oh£=Q = u{d J B{0)) A- Kdet B (0))2 ( 5 (°) " ^ ■ 

holds, where Vx denotes the gradient with respect to the variable X. This 
implies that the gradient of F(X,e) does not vanish at (X,e) = (B(0),0). 
Hence, the implicit function theorem for F(X,e) guarantees that B(e) is 
a second order continuously differentiable function with respect to e in a 
vicinity of e = 0. □ 



C Computations of Gross Error Sensitivity 

First, a universal formula for the computation of influence function is proved, 
and some useful lemmas are prepared. Then, the gross error sensitivity for 
each update formula is computed in Section IC.lt IC.2[ IC.3I and IC.4I 

Lemma 12. Let s,s,y and y be column vectors in R n such that s T y > 0, 
and Bk be a positive definite matrix. For an infinitesimal e let B{e) be the 
optimal solution of 

min Dy(B, Bk) subject to B(s + es) = y + ey, (32) 

BePD(n) 



2(3 



and let A[B k ; s,s,y,y] be the influence function 5(0). Then we have 
5(0) 
= A[B k ;s,s,y,y] 

s T y-s T y i^(detB(0)) / 2s T B k s-s T B k (B(0))- 1 B k s 2s T 5 fc (5(0))- 1 5 fc s 
s T y v(detB k ) { (s T B k s) 2 s T B k s 



/3(det5(0)) 



1 - (n- l)/3(det B(0)) 



B(o) yy 



T 



. yy + yy s T y + s T y T 

+ — t ( t \2 yy 

s'y (s 1 y) 2 



z/(det5(0)) 
z^(det B k ) 



2s T B k s T 5 fe (ss T + ss r )B k 

;B k SS B k 



_(s r B k s) 2 ~ K "" ~ K s T B k s J' (33) 

The matrix A[B k ; s, s, y, y] is well-defined, since the inequalities v > 
and 1— (n— 1)/3 > hold for any potential function. Note that A[B k ; s, s, y, y] = 
O holds. This is another proof of Lemma [5j 

Proof of Lemma [23 In the same way as the proof of Theorem Q] and The- 
orem [6l we can prove the existence and the differentiability of B{e). Since 
5(e) is second order continuously differentiable around e = 0, the equality 

B(e) =5(0) + eA + 0(e 2 ), 

holds, where A € Sym(n). Then we have 

det(5(e)) = det(5(0) + eA + 0{e 2 )) 

= det(5(0)) + edet(5(0))(A,5(0)- 1 ) + 0(e 2 ) 

and thus we obtain 

v{detB{e)) = v{det 5(0)) + ez/(det 5(0)) det(5(0))(A, 5(0)^) + 0{e 2 ). 
For simplicity let 5 be 

5 = det(5(0))(A,5(0r 1 } (34) 

then the equality 

u(det 5(e)) = u(det 5(0)) +e-S- i/'(det 5(0)) + 0(e 2 ) (35) 

holds. By some calculation, we see that the asymptotic expansion 

es, y + ey] and (y + ey)(y + ey) T / (s + es) T (y + ey) are respectively given by 

B BFGS [B k] s + es,y + ey] 

= B BFGS [B k ;s,y] 



+ e 



yy T + yy T s T y + s T ?/ T B k (ss T + ss T )B k t 2s T B k s T \ 
s^y (s^y) 2 Vy s T B k s + (sT Bk s) 2 ***** * k ) 



+ 0(e 2 ) (36) 
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and 

(y + zy)(y + ey) T yy T . fyy T + yy T s T y + s T y T \ 2 

{s + es) 1 {y + ey) s l y \ s l y {s l y) 2 J 

(37) 

Substituting ([35]) . (f36|) and (j37]) into the equality 

B = ^f^V^[i? fc ; s + £S - y + £y1 + ( 1 Kdet 5(e)) ^ (y + ey)(y + ey) 



u(detB k ) '* yJ V Kdet£ fc ) / (s + es) T (y + ey)' 

we obtain 



B(e) 

. ^(det^(O)) (R BFGSr Ri . . _.i _ , + _ S T y + ^ T |/ T 

Kdet£ fc ) 1 1 *' *V S T y (s T y) 2 W 



i/(det£(0))£ fe (sg T + ss T )£ fc t/(det5(0)) 2s T £ fc s T 
i/(det£ fc ) s T £? fcS + i/(detfl fc ) (s T B k s) 2 kSS k < 

and thus A is represented as 



5- 



i/(detB(0)) 
z^(det i?^) 



T 



s 1 y 



yy + yy s y + « y t 

— r / t \2 yy 

s'y (s'y) 2 



i/(det £(0)) ^ fc (gg T + sfi T )^fc t/(det5(0)) 2s T £ fc s T 
" i/(detS fc ) s^BfcS + ^(detS fc ) (s^s) 2 ^ * 

t/(det_B(0)) r _ yy^_, yy T + yy T _ s T y + s T y T 
" i/(detfl(0)) L 1 J s T y J s T y (s T y) 2 ^ 

z^(det 5(0)) S fc (ss T + ss T )B fc i/(detB(0)) 2s T 5 fe s T 
" ^(detS fc ) FTB^ + ^(detBfc) ^ k 

in which we use the equality 



i/(det£(0)) 



i/(det i?^) 

Substituting the above A into (|34|) . we have 
detB(O) (s T y-s T y 



s ' y 



5 ( o) - yf. 

s 1 y 



1 -/9(detS(0))(n- 1) I s T y 

i/(det£(0)) (2s T B k s ■ s T B k (B{Q))- l B k s 2s T B^B^))- 1 B k s 



+ 



KdetB fc ) V (s T 5 fc s) 2 s T £ fe s 
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As the result, we obtain 
B(e) - B(0) 



s T y - s T y t/(det£(0)) ( 2s T B k s ■ s T B^BjO^B^ _ 2s T B^B^))- 1 B k s 

s Ty „M c t R, 1 \ f TR, ^2 JR, C 



v(detB k ) V 
0(detB(O)) 



(s T B k s) 2 



s B k s 



l-(n-l) / 9(detB(0)) 



, yy T + ra T « T y + s T y T 

+ ' T^2 ^ 



s T y 



+ 



i/(detB(0)) 
z^(det 



2s T B k s 

(s T B kS y 



B k ss T B k 



B k (ss T + ss T )B k 
s T B k s 



(s T y) 
+ 0{e) 



Letting e tend to zero, we obtain the influence function B(0) = A[B k ; s, s, y, y]. 

□ 

Lemma 13. Let s, s, y and y be a set of column vectors in M n such that 
s T y > and B k be a matrix in PD(n). For an infinitesimal e let B(e) be 
the optimal solution of 

min DyiB^^BT 1 ) subject to Bis + es) = y + ey 

BePD(n) K 

and let T[B k ; s,s,y,y] be B(0) then we have 

T[B k ; s, s, y, y] = -B(0)A[B^; y, y, s, s]B(0), (38) 
where A is the function defined in Lemma UM 
Proof. Let H (e) be the optimal solution of 

min Dy(H, B7 1 ) subject to H(y + ey) = s + es 

#ePD(n) K 

then, clearly B(e) = H(e)~ 1 holds. Thus we have 

T[B k ;s,s,y,y] = B(0) = -JJfOj^ff^^tO)^ = -B(0)A[B^;y,y,s,s}B(0), 
where H(0) = AIB7 1 ; y, y, s, s] is applied. □ 

We show another lemma which is useful to prove that the gross error 
sensitivity diverges to infinity. 

Lemma 14. Suppose n > k + 3 for non-negative integers n and k. For any 
set of vectors s,y,y± . . . ,y k 6 M n such that s T y > and any positive real 
number d, there exists a sequence {Bi}'?2 zl C PD(n) satisfying the following 
three conditions: 
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1. The equalities Biy = s and Biy m = Bjy m hold for all i,j > 1 and 
m = 1, . . . , k. 

2. det(Bi) = d for all i > 1. 

3. lim^oo \\Bi\\ F = oo. 

Proof. For any s,y € M. n such that s T y > there exists B € PD(n) sat- 
isfying Bs = y. Indeed, for the n by n identity matrix /, the matrix 
B = B BFGS [I; s,y] 6 PD(?i) is well-defined and satisfies Bs = y. When 
n > k + 3 holds, there exist two unit vectors pi,P2 € ^ n satisfying pjp2 = 
and 

pl{B 1 ' 2 s) = Q, pJ(B 1 / 2 y m ) = 0, m = l,...,k, 
p 2 Y {B 1 ' 2 s) = Q, p 2 T (B 1 / 2 y m ) = 0, m = l,...,k. 

We will show that the matrix 

B(a) = J B 1 /2(/ + apip J + bp 2P J)B 1 / 2 

with 

a>0 . b = (39) 

1 + a 

satisfies four conditions: B(a)s = y, B(a)y m = By m , det B(a) = d and 
B{a) G PD(n) for all a > 0. The first two equalities are clear from the 
definition of pi, p 2 and B. The determinant of B(a) is equal to 

det(S(a)) = det(S) det(7 + apxpj + bp 2 pj) = det(5)(l + a)(l + 6) = d. 

For any unit vector x € K n we have 

x T (I + api^7 + bp2pj)x = 1 + a(p7^) 2 + a; ) 2 

> 1 + b(pjx) 2 (-.- a > 0) 

> 1 - (pjx) 2 (-.- b > -1) 

> (Schwarz inequality) 

and in addition the determinant of (I+apipJ +bp2pj) is equal to dj det(-B) > 
0. Thus B(a) is positive definite. Let Ai(a) be the maximum eigenvalue of 
B(a), and x be a unit vector defined by x = -B -1 / 2 pi/||-B -1 / 2 pi||. Then in 
terms of the maximum eigenvalue of B{a) we have 

\\B{a)\\ F > Ai(a) > x T Bx + ° . 
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Then ||£?(a)||i? tends to infinity when a tends to infinity. Thus the sequence 
defined by 

Bi = B(i), i = 1,2,3,... (40) 
satisfies the conditions of the lemma. □ 



C.l Proof of Theorem [7] 



Let B(e) be the optimal solution of (|24p . Under the inexact line search, the 
influence function B(0) for U-BFGS-B is equal to A[B k ; s, s,y,y] which is 
defined in Lemma [T2J Thus we have 



B(0) 



(y - y) 



/3(det5(0)) 



s 1 y 1 



n 



l)/3(det£(0)) 



B(0) 



yy 

s T y 



+ 



yy T + y y 

s T y 



T 



(y + y) 1 
(s T y) 2 

(41) 



s T 

-yy ■ 



If (y — y) T s = holds for any y € 3^, the potential does not affect the norm 
of the influence function, because the first term of the above expression 
vanishes. Thus, clearly V(z) = — log(z) is an optimal potential. Below we 
assume (y — y) 1 s ^ for a vector y G y. Suppose that Bk satisfies B^s = y. 
Then B(0) = B^ holds, and the triangle inequality yields that 

||S(0)||jr = \\A[B k ;s,s,y,y]\\ F 



> 



(y-y) 



(3(detB k ) 



s T y 

yy T + yy 



(n-l)/3(det B k ) 

(y + y) T 



\B k \ 



— I 

s T y 



s T y 



s T 

-yy 



( S Ty)2 

If j3{z) is not the null function, there exists d > such that (3(d) ^ 0. Lemma 
[]3]with k = implies that for n > 3 there exists a sequence {Bi} C PD(n) 
satisfying BiS = y, det-B, = d for all i and lim^oo = oo. Hence 



lim \\A[Bi;s,s,y,y}\\F 

i— >oo 



OO 



holds, and then we obtain 

sup{ \\A[B k ;s,s,y,y]\\ F \ B k £ PD(n), y € y } 
On the other hand, if /3(z) = for all z > 0, we obtain 



oo. 



max\\A[B k ;s,s,y,y]\\ F 



max 



yy T + yy T 



(y + y) T s T 
-yy 



s T y (s T y) 2 
since 3^ is bounded. As the result, the potential V such that (3y 



< oo, 



imizes the gross error sensitivity. The condition f5y 
— log(z) up to a constant factor. 



min- 
leads to V(z) = 
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C.2 Proof of Theorem H 

Let B{e) be the optimal solution of (|25|) . Under the inexact line search, 
the influence function -B(O) for U-DFP-B is equal to T[B k ; s, s, y, y] which is 
defined in Lemma [T3"l 

First, we study the case that f3{z) is not the null function. For the matrix 
B k such that B k s = y, we have B(0) = B k . Using Lemma [TBI for -B(O) = Bk, 
we have 



B(0) = -B k A[B- l ]y ,y,s,s]B k 

(y - y) T s P(det(B k ) 



s T y 1 - (n - l)y3(det(J3. 



fe 



yy 1 . yy +yy (y + v) s T 

B k -— + — 1 t \2 yy ' 

s'y {s [ ur 



s T y 



in which the equality B k s = y is used. The above expression is almost same 
as (|4ip with B(0) = Bk, and thus the same proof works to obtain 

sup{ 115(0)11*. | B k 6 PD(n), y G 3> } = oo. 

Next, we study the case that /3 is the null function, that is, = 0. 
Then, V(z) = — \og{z) and v(z) = 1 hold. Let B k be a positive definite 
matrix which does not necessarily satisfy B k s = y. Then we obtain 

5(0) = -B(Q)A[B^;y,y, s,s]B(Q) 

(y-y) T s T B(o)B^(yy T + yy T )B^B(o) 

in which we used 5(0) s = y. For ,9 = 0, the updated matrix -B(O) is equal 
to B DFP [B k ; s,y] and thus, we have 

B(0)B , _ , _ B^Bf+y.-' + f*.^ + _^„„T B -i. (42) 

s 1 y (s 1 yy s 1 y 

Let .B € PD(n) and c be a positive real number, and we define t = Bs, then 
for B k = cB some calculation yields 

B(0) = -B(0)A[(cB)-^y, y, s, s]B(0) = - ^£^ T + vf % ^ 

s'y {s'yr s y 

where Z is defined by 

„ ( s T t \f s T y \ T / s T y \/ s T t 

z = [t-^rv )[y--^y) + [y - -^y U - -^y 

V s'y / V s [ y j v s y J \ s y 
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Since y contains an open subset, there exists a vector y € y which is linearly 
independent to y. Clearly there exists B € PD(n) such that three vectors, 
t = Bs, y and y, are linearly independent. For such choice, Z is not the null 
matrix, and the equality 

lim ||.B(0) A[(cB) -1 ; y, y, s, s]£(0)||f = oo 

c— »oo 

holds. As the result, even for the standard DFP formula, we have 

sup{|| J B(0)|| F | B € PD(ra), y £ y} = oo. 

In summary, for all V-DFP update for the Hessian approximation, the gross 
error sensitivity defined in Theorem [8] is equal to infinity. 



C .3 Proof of Theorem M 

Let H(e) be the optimal solution of (|26p . Under the inexact line search, the 
influence function H(0) for V-BFGS-H is equal to T[H k ;y,y,s,s] which is 
defined in Lemma [T3l 

First, we study the case that f3(z) is not the null function. Suppose 
/3(d) 7^ 0. If Hk satisfies H k y = s, then we have H k = H(0). Using Lemma 
[12] and Lemma [TBI for the matrix such that H^y = s, we obtain 

H(0) 

= -HkAlH' 1 ; s, s, y, y]H k 



(y - y) 



P(det(H k 



s T y 



1 - (n - l)/9(det(flfc)-i) 



Hi 



ss 



s T y 



H k ys T + sy T H k (y + y) T s T 

! ; =p — SS 



s T y 



(s T y¥ 



(43) 



Lemma [T4l with k = 1 implies that for n > 4 there exists a sequence {Hi} C 
PD(n) satisfying the following conditions: Hiy = s and (det-fTj) -1 = d for 
all i > 1; = Hj-y for all i, j > 1; lim^oo = oo. We define i = Hiy 

which does not depend on i. Then for H k = Hi we have 

\\H(0)\\f 
= WHi^H^^s^s.y^HiWF 



> 



(y - y) 



s T y 



/3(d) 



1 



in 



l)P(d) 



ss I 

s T y 



ts T + st T 



s T y 



(y + y) T s T 
( s T y) 2 ss 



Hence the equality 



lim \\H i A[H i 1 ,s,s,y,y]H i \\ F 

1— >oo 



OO 
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holds, and thus we obtain 



sup{ \\H(0)\\f I H k G PD(n), y € ? } = oo. 

Next, we study the case that /3 is the null function, that is, f${z) = 0. 
Then, V(z) = — log(z) and v(z) = 1 holds. For such that H^y = s, we 
have 



F(0) = _ g^±^ + M!i„T (44) 
s 1 y (s 1 y)^ 



Let i^o G PD(n) be a matrix satisfying H$y = s. Let p\ G M n and y G y be 

T — 1/2 T — 1/2 — 

vectors satisfying p{ H Q y = and p{ H Q y ^ 0. For n > 4, the existence 
of pi and y is guaranteed by the assumption on y. Indeed, there exists 
y G y such that y and y are linearly independent. We now define the 
matrix Hi G PD(n) by 

Hi = Hl'\l + i • pipDh] 12 , i = 0, 1, 2, . . . 

Then we have 

#jy = s, Jffjy = z + i ■ u, 

— -|- — ]_ /2 — 1/2 — 

where z = i^oy and u = {p{ H y)H Q p\ ^ 0. Substituting Hk = Hi into 
we obtain 

■ . us T + su T (y + y) T s T zs T + sz T 
H(0) = -i ■ =f 1 =p ss = — 



s ' y s 1 y s 1 y 



This implies that 



lim ||i^A[F. x ; s,s,y,y]Hi\\ = oo. 

for /3 = 0. Hence we obtain 

sup{||iT(0)|| F | Hk G PD(n), y G y] = oo 
even for the standard BFGS update of the inverse Hessian approximation. 
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C .4 Proof of Theorem [TO] 



Let H(e) be the optimal solution of (|27p . Under the inexact line search, 
the influence function H(0) for U-DFP-H is equal to A[H k ;y,y, s, s] which 
is defined in Lemma [TZl 

First, we study the case that (3(z) is not the null function. Suppose 
{3(d) 7^ for d > 0. If H k satisfies H k y = s, we have H k = H(0). Using 
Lemma [T2l for the matrix H k such that H k y = s, we obtain 



H(0) 

= A[H k ;y,y,s,s] 

(y-y) T s P(detH k ) 



s T y 



1 



n 



l)/3(det H k 



ss 



s T y 



H k ys T + sy T H k t (y + y) T s T 



s T y 



(s T y¥ 



-ss 



The above expression is almost same as (143p . and thus the same proof re- 
mains valid to obtain 

sup{||#(0)||jr | H k 6 PD(n), y G 3^} = oo. 

Next, we consider the case that (5 is the null function. Then V(z) = 
— log(z) and v(z) = 1 hold. For H k such that H k y = s, we have 



H(0) = A[H k ;y,y,s,8] 



H k ys T + sy T H k 



s T y 



[y + y) s T 

-ss . 



(s T y) 2 

This is the same as the influence function of (|44[) . and thus, we obtain 
sup{||H(0)|| F j H k G PD(n), y G 3^} = oo. 
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Table 2: Approximate influence function for F-BFGS update and F-DFP 
update is shown. The power potential V(z) = (1 — z 7 )/7 is used for V- 
extended quasi-Newton methods, where 7 = corresponds to BFGS or 
DFP method. 



V-BFGS-B 





B k 


diag(l 


. . . ,n)/(n 


\l/n 




diag(l, . . . , n 






I + n 3 pp 






7 


-2 


-1 





-2 


-1 





-2 


-1 





n 


= 10 


9.5e+00 


9.5e+00 


9.5e+00 


1.5e+01 


9.7e+00 


9.5e+00 


2.0e+02 


1.0e+02 


5.0e+01 


n 


= 100 


2.7e+01 


2.7e+01 


2.7e+01 


2.3e+02 


2.8e+01 


2.7e+01 


l.le+04 


1.0e+04 


8.7e+03 


n 


= 500 


9.3e+01 


9.3e+01 


9.3e+01 


2.8e+03 


9.6e+01 


9.3e+01 


2.6e+05 


2.5e+05 


2.4e+05 


n 


= 1000 


1.0e+02 


1.0e+02 


1.0e+02 


7.4e+03 


l.le+02 


1.0e+02 


1.0e+06 


9.9e+05 


9.7e+05 



V-DFP-B 





B k 


diag(l 


...,n)/(n 


\l/n 




diag(l, ...,n 






I + n 3 pp 






7 


-2 


-1 





-2 


-1 





-2 


-1 





n 


= 10 


1.3e+02 


1.3e+02 


1.3e+02 


2.9e+03 


6.5e+02 


1.5e+02 


2.0e+02 


1.0e+02 


5.0e+01 


11 


= 100 


1.7e+03 


1.7e+03 


1.7e+03 


2.5e+06 


6.5e+04 


1.7e+03 


l.le+04 


1.0e+04 


8.7e+03 


11 


= 500 


4.6e+04 


4.6e+04 


4.6e+04 


1.6e+09 


8.7e+06 


4.7e+04 


2.6e+05 


2.5e+05 


2.4e+05 


11 


= 1000 


3.0e+04 


3.0e+04 


3.0e+04 


4.1e+09 


l.le+07 


3.0e+04 


1.0e+06 


9.9e+05 


9.7e+05 



V'-BFGS-H 



7 


diag(l,...,n)/(n!) i /" 
-2 -1 


diag(l, . . . , n) 

-2 -1 


I + n 3 pp 
-2 -1 


n = 10 
n = 100 
n = 500 
n = 1000 


2.1e+02 2.1e+02 2.1c+02 
l.le+03 l.le+03 l.le+03 
8.2e+04 8.2e+04 8.2e+04 
2.6e+04 2.6e+04 2.6e+04 


4.8e+03 l.le+03 2.4e+02 
1.6e+06 4.1e+04 l.le+03 
2.8e+09 1.5e+07 8.3e+04 
3.6e+09 9.8e+06 2.7e+04 


2.2e+02 l.le+02 5.6e+01 
2.0e+04 1.7e+04 1.5e+04 
8.7e+05 8.4e+05 8.1e+05 
4.7e+06 4.6e+06 4.5e+06 


1/-DFP-H 


Hk 

7 


diag(l,...,n)/(n!) 1 /' 1 
-2 -1 


diag(l,...,n) 
-2 -1 


/ + n 3 pp T 
-2 -1 


n= 10 
n = 100 
n = 500 
n = 1000 


1.0e+01 1.0e+01 1.0e+01 
2.1e+01 2.1e+01 2.1e+01 
9.9e+01 9.9e+01 9.9e+01 
1.2e+02 1.2e+02 1.2e+02 


1.7e+01 l.le+01 1.0e+01 
4.5e+02 2.5e+01 2.1e+01 
9.5e+03 1.2e+02 9.9e+01 
3.6e+04 1.7e+02 1.2e+02 


2.5e+02 1.3e+02 6.4e+01 
4.1e+06 3.6e+06 3.1e+06 
1.4e+09 1.4e+09 1.3e+09 
1.2e+10 1.2e+10 1.2e+10 
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Table 3: Number of iterations by BFGS and DFP under inexact line search. 
The number of h denotes intensity of noise involved in the line search. 







n = 


100 


n = 


500 


n = 


1000 




h 


BFGS 


DFP 


BFGS 


DFP 


BFGS 


DFP 


Problem 1 


0.0 


100.4 


110.6 


434.6 


577.8 


682.1 


1788.5 




0.1 


102.9 


166.2 


430.6 


1165.2 


680.9 


2628.9 




0.2 


104.5 


198.6 


443.6 


1361.8 


685.1 


3099.2 




0.3 


106.0 


223.0 


444.2 


1501.6 


687.6 


3365.9 


Problem 2 


0.0 


100.9 


111.6 


428.5 


585.7 


661.5 


2489.8 




0.1 


102.8 


153.5 


443.5 


1237.4 


672.4 


2762.1 




0.2 


104.4 


177.7 


438.3 


1419.6 


682.7 


3301.2 




0.3 


106.1 


199.4 


454.0 


1592.8 


694.0 


3730.8 
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